==========================================
NCHLT: Text OCR Engines
==========================================
Contents:
------------------------------------------

	1. Introduction
		1.1 Licenses	
	2. Required files		
	3. Installing
		3.1 Installing Tesseract
		3.2 Installing language models		
	4. Using Tesseract
	
__________________________________________
1. Introduction:
------------------------------------------
An OCR engine is an application that enables a user to convert images of 
scanned documents into editable and searchable electronic texts.

The supplied OCR models form part of the NCHLT: Text III project and can convert 
scanned documents (image files), in one of ten South African languages, to text. 
An English OCR model is also included in the default Tesseract installation.

The OCR models were developed to be used with the Tesseract command line application. 

------------------------------------------
1.1 Licenses:
------------------------------------------

Tesseract:

	./LICENSE-Tesseract.txt


NCHLT OCR (language models):

	./LICENSE-NCHLT-OCR.txt
	
__________________________________________
2. Required files:
------------------------------------------

The installation file for Tesseract:

	./tesseract-ocr-setup-3.02.02.exe
	
The installation file for language models:

	./nchlt-ocr-setup.exe

__________________________________________
3. Installing:
------------------------------------------

To install the language models, the Tesseract application has to be 
installed first. Make sure you run step 3.1 BEFORE proceeding to step 
3.2.

------------------------------------------
3.1 Installing Tesseract (required):
------------------------------------------

Run "tesseract-ocr-setup-3.02.02.exe" by double clicking on it *.
Follow the instructions and finish the installation.
Please ensure that Tesseract is installed in its default location,
i.e. "C:\Program Files (x86)\Tesseract-OCR" or
"C:\Program Files\Tesseract-OCR" for 32-bit systems.

* Right click and "Run as administrator" to install Tesseract to "C:\Program Files"
if you are not using an administrator account.


------------------------------------------
3.2 Installing language models (required):
------------------------------------------

Run "nchlt-ocr-setup.exe" by double clicking on it *.
Follow the instructions and finish the installation.
Please ensure that the language models are installed in the default location
of Tesseract, i.e. "C:\Program Files (x86)\Tesseract-OCR" or
"C:\Program Files\Tesseract-OCR" for 32-bit systems.

* Right click and "Run as administrator" to install models to "C:\Program Files"
if you are not using an administrator account.

__________________________________________
4. Using Tesseract:
------------------------------------------

	1. Open new command line prompt.
	2. Navigate to the directory of images you want to convert to text.
	3. Run the following command:
		tesseract [ImageFileName] [TextOutputFileName] -l [Language]
	4. The converted file will be generated in the same directory as the input file.
		
		Arguments:
		ImageFileName			-	The jpg image file name to convert to text, including the extension
		TextOutputFileName		-	The filename of the text output without an extension
		Language				-	The two letter language code of the language model to use (as described below)
		
		Language codes: 
		
		af	-	Afrikaans
		nr	-	isiNdebele
		nso	-	Sepedi
		ss	-	siSwati
		st	-	Sesotho
		tn	-	Setswana
		ts	-	Xitsonga
		ve	-	Tshivenda
		xh	-	isiXhosa
		zu	-	isiZulu
		
		To convert an English image, omit the language option, -l 
		i.e.: tesseract [ImageFileName] [TextOutputFileName]
		
		Examples:
		
		tesseract page.jpg page -l ve
			This will convert the Tshivenda image "page.jpg" to a text file, "page.txt". 
		tesseract scan.jpg scan -l nso
			This will convert the Sepedi image "scan.jpg" to a text file, "scan.txt".
		tesseract page.jpg page
			This will convert the English image "page.jpg" to a text file, "page.txt".
